Inverse Document Frequency (IDF): A Measure of Deviations from Poisson

نویسندگان

  • Kenneth Church
  • William Gale
چکیده

Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally mean!ngful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson). 1. Document frequency is similar to word frequency, but different Word frequency is commonly used in all sorts of natural language applications. The practice implicitly assumes that words (and ngrams) are distributed by a single parameter distribution such as a Poisson or a Binomial. But we find that these distributions do not fit the data very well. Both the Poisson and Binomial assume that the variance over documents is no larger than the mean, and yet, we find that it can be quite a bit larger, especially for interesting words such as boycott where there are hidden variables such as topic that conspire to undermine the independence assumption behind the Poisson and the Binomial. Much better fits are obtained by introducing a second parameter such as inverse document frequency (IDF). Inverse document frequency (IDF) is commonly used in Information Retrieval (Sparck Jones, 1972). IDF is defined as -log2dfw/D, where D is the number of documents in the collection and dfw is the document frequency, the number of documents that contain w. Obviously, there is a strong relationship between document frequency, dfw, and word frequency, fw. The relationship is shown in Figure 1, a plot of iog]0fw and IDF for 193 words selected from a 50 million word corpus of 1989 Associated Press (AP) Newswire stories (D = 85,432 stories). Although log lofw is highly correlated with IDF (p =-0 .994) , it would be a mistake to assume that the two variables are completely predictable from one another. Indeed, the experience of the Information Retrieval community has indicated that IDF is a very useful quantity. Attempts to replace IDF with fw (or some simple transform offw) have not been very successful. Figure 2 shows one such attempt. It compares the observed IDF with II~F, an estimate based on f Assume that a document is merely a "bag of words" with no interesting structure (content). Words are randomly generated by a Poisson process, n. The probability of k instances of a word w is n(0 ,k) fw where O= : D

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Comparative Analysis of IDF Methods to Determine Word Relevance in Web Document

Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. When it is used in combination with the term frequency (TF), the result is a very effective term weighting scheme (TF-IDF) that has been applied in information retrieval to determine the weight of the terms. Terms with high TF-IDF values imply a strong relationship with the document the...

متن کامل

Generating Text Summaries through the Relative Importance of Topics

This work proposes a new extractive text-summarization algorithm based on the importance of the topics contained in a document. The basic ideas of the proposed algorithm are as follows. At first the document is partitioned by using the TextTiling algorithm, which identifies topics (coherent segments of text) based on the TF-IDF metric. Then for each topic the algorithm computes a measure of its...

متن کامل

Using TF-IDF to Determine Word Relevance in Document Queries

In this paper, we examine the results of applying Term Frequency Inverse Document Frequency (TF-IDF) to determine what words in a corpus of documents might be more favorable to use in a query. As the term implies, TF-IDF calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appear...

متن کامل

A Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection

Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995